Linear Models

Linear Model

Linear Model is one of the simplest model, we use this section as the starting point to introduce some common ideas in Data Science.

Least Square Problem

Suppose we have $n$ data points, $x_1,\cdots,x_n$, in $\mathbb{R}^p $, denote $$\mathbf{X} \in \mathbb{R}^{n\times p }$$ the matrix with entries given by $$\mathbf{X}_{i,j}=(x_i)_j$$ For each point, $x_i$, there is a corresponding output $y_i$, we will denote the output vector by $$\mathbf{y} := \begin{pmatrix} y_1\\\vdots\\y_n \end{pmatrix}$$ Denote the design matrix $$X:=[\mathbf{X}|\mathbb{1}] \in \mathbb{R}^{n\times (p+1) }$$ The goal of a least square problem is to find the vector $\mathbf{\beta}\in \mathbb{R}^{p+1}$ that minimizes the Sum of Squared Error $$\mathrm{SSE}(w):=\|X w - \mathbf{y}\|_2^2 $$

Geometric Approach

Projection Matrix

We start with the geometric way, the one that I found most natural. Recall that we have the following minimization problem, $$\min_{w\in \R^{p+1}} \|X w - \mathbf{y}\|_2^2$$ The object $$Xw$$ lives in the column space of $X$, thus to minimizes the above squared distance, we only need to require $Xw= \widehat{\mathbf{y}}$ where $ \widehat{\mathbf{y}}$ is the projection of $\mathbf{y}$ onto the column space of $X$. In other words, recall that the projection can be done via the projection matrix $X( X^T X)^{-1}X^T$, we need to have $$Xw= \widehat{\mathbf{y}} =X( X^T X)^{-1}X^T \mathbf{y}$$ Which implies $$ w = ( X^T X)^{-1}X^T \mathbf{y}$$ Since $$\hat{\mathbf{y}} = Xw =X( X^T X)^{-1}X^T \mathbf{y} $$ The following projection matrix is also often called the hat matrix $$P = X( X^T X)^{-1}X^T$$

Correlations, R Squared and F-statistics

Correlations
Denoting the $i$-th column of $X$ by $X_i$. The standard definition of correlation can be rewritten as $$ \begin{aligned} \mathtt{Corr}\left(\mathbf{y}, X_{i}\right):&=\frac{\sum(\mathbf{y}_j-\overline{\mathbf{y}}) (X_{ji}-\overline{X_{i}})}{\sqrt{\sum( X_{ji}-\overline{X_{i}})^2\sum(\mathbf{y_j}-\overline{\mathbf{y}})^2}} \\ &=\frac{\left(\mathbf{y}-\overline{\mathbf{y}}\right)\cdot\left( X_{i}-\overline{X_{i}}\right)}{\|X_i\|\|\mathbf{y}\|}\\ &=\cos \left(\angle\left(\mathbf{y}-\overline{\mathbf{y}}, X_{i}-\overline{X_{i}}\right)\right) \end{aligned} $$
$R^2$
Similarly, using the decomposition $$SST=SSE+SSR$$ $$R^{2}:=\frac{SSR}{SST}=\cos ^{2}(\angle(\mathbf{y}-\overline{\mathbf{y}}, \widehat{\mathbf{y}}-\overline{\mathbf{y}}))$$
$F$ Statistics
The $F$-statistics for nested models $$ F=\frac{\frac{\mathrm{RSS}_{\text {Full }}-\mathrm{RSS}_{\text {Reduced }}}{\mathrm{df}_{\text {Reduced }}-\mathrm{df}_{\text {Full }}}}{\frac{\mathrm{RSS}_{\text {Full }}}{\mathrm{df}_{\text {Full }}}} \propto \cot ^{2}(\angle\left(\widehat{Y}_{\mathrm{F}}-\bar{Y}, \widehat{Y}_{\mathrm{R}}-\bar{Y}\right)) $$

Analytic Approaches

Normal Equation

The loss function can be expanded $$\mathrm{SSE}(b)=b^TX^TXb-2(Xb)^T\mathbf{y}+\mathbf{y}^T\mathbf{y} $$ Taking derivative (gradient) w.r.t $b$ gives $$\nabla \mathrm{SSE} = 2X^TX b - 2X^T \mathbf{y}$$ At the local minimals, we have $$\nabla \mathrm{SSE} = 2X^TX b - 2X^T \mathbf{y}=0$$ which gives us the, Normal Equation $$X^TX b= X^T \mathbf{y}$$ When the inverse of $X^TX$ exist, we have, again $$ b = ( X^T X)^{-1}X^T \mathbf{y}$$
$$ \begin{aligned} \nabla b^TX^TXb &= \left[\lim_{\epsilon \to 0} \frac{(b+\epsilon e_i)^TX^TX(b+\epsilon e_i)-b^TX^TXb}{\epsilon}\right] \\ &= \left[\lim_{\epsilon \to 0} \frac{b^TX^TXb+2\epsilon b^TX^TXe_i+\epsilon^2 e_i^TX^TXe_i-b^TX^TXb}{\epsilon} \right]\\ &=\left[ \lim_{\epsilon \to 0} \frac{2\epsilon b^TX^TXe_i+\epsilon^2 e_i^TX^TXe_i}{\epsilon}\right] \\ &= 2X^TXb \end{aligned} $$

Gradient Descend

For ideas behind the Gradient Descend Algorithm, see the page Optimization Methods. In the example, we applied the Gradient Descend algorithm to minimize $\mathrm{SSE}(b)$. The updating rule is given by $$b_{j+1}=b_{j}-\alpha X^{T}(Xb-Y)$$

The Hessian of $\mathrm{SSE}$ is given by $$H_\mathrm{SSE}(b)=J_{\nabla \mathrm{SSE}}( b )=2X^TX$$ Thus $H_\mathrm{SSE} (b)$ is positive definite, locally by the Derivative Test, any local optimal of $\mathrm{SSE}(b)$ will be a local minimum. And globally $H_\mathrm{SSE} (b) $ positive definite for all $b$ implies $\mathrm{SSE}(b)$ is convex , and thus our local minimum is the unique global minimum.

Parametric Approach

Linear Model Assumptions

When regard Linear Regression Problems as problems of fitting the least square line to the data points, it is just a mathematical problem. But when one tries to do Statistical Inference then it becomes a statistical problem. And assumptions are needed for meaningful interpretations.

A Linear Model in Regression is a model of the form $$\mathbf{y}=X b+\mathcal{E} $$ Where the residual is a random vector with distribution $\mathcal{E}\sim \mathcal{N}(0,\sigma I)$.

Maximum Likelihood

Let's take a small quiz:

Let $\mathbf{x}$ be a mean $1$ Gaussian Random variable, and we have 4 realizations of $\mathbf{x}$

Which of the following choice is a more likely distribution for $\mathbf{x}$ ?

The idea of maximum Likelihood essentially formalizes the above example. Let $V$ be a random vector distributed according to the following density, a function in $V$, $$ p(V|\theta)$$ Given a realization $v$ of $V$, the likelihood function is defined by the following, a function in $\theta$, $$L(\theta): = p(V|\theta)$$ And the maximum likelihood estimator for $\theta$ is given by $$ \mathrm{argmax}\ L(\theta) $$ Since the $\ln$ function is monotonically increasing, and turns products into sums, it is often easier to work with the log-likelihood function $$l(\theta)=\ln L(\theta)$$ And the corresponding optimization problem $$ \mathrm{argmax}\ l(\theta) $$

In the linear regression model, it turns out that the MLE for $b$ does not depend on $\sigma$, thus for simplicity we will assume $$\mathbf{y}=X b+\mathcal{E}\sim \mathcal{N}(Xb,I) $$ Given $\mathbf{y}$, $X$ fixed, the Likelihood function for $b$ is given by $$L(b)=(2 \pi)^{-\frac{n}{2}} e^{-\frac{1}{2}(\mathbf{y} -Xb)^{T}(\mathbf{y} -Xb)}, $$ and the log-likelihood is given by $$l(b)=\text{constant}-\frac{1}{2}(\mathbf{y} -Xb)^{T}(\mathbf{y} -Xb) $$ Note that from here we see that the maximization problem for MLE is equivalent to the minimization problem for $\mathrm{SSE}(b)$.

Both the Derivative Test, and Gradient Descend/Ascend can be used to find or estimate the maximum likelihood estimator. But let us use this opportunity to apply Newton's method.

Newton's Method

The idea behind Newton's Method is discussed on the page Optimization Methods. We go straight to its application to maximizing the log-likelihood function. To maximize $$l(b)=\text{constant}-\frac{1}{2}(\mathbf{y} -Xb)^{T}(\mathbf{y} -Xb) $$ We apply Newton's method to $$f(b):=\nabla l(b) = -X^TX b + X^T \mathbf{y}$$ We have, the Jacobian of $f$ given by $$J_f(b) = H_l(b)=-X^TX$$ Note that $J_f(b)$ is negative definite, thus locally by the Derivative Test, any local optimal of $l(b)$ will be a local maximum. And globally $J_f(b) $ negative definite for all $b$ implies $l(b)$ is concave , and thus our local maximum is the unique global maximum.

Given any initial guess $b $, one step of Newton's Method gives $$\begin{align} \beta&=b-J_f(b)^{-1}f(b)\\ & =b+(X^TX)^{-1}(-X^TXb+X^Ty)\\ &=(X^TX)^{-1}X^Ty \end{align}$$ i.e. only one step is needed. The application of Newton's method to Linear Regression requires only one step, since $f(b)$ is a affine map in $b$, i.e. it is of the form $f(b) = Mb+v$. A simple nontrivial application of Newton's Method can be found on the page Optimization Methods.

Regularizations

Regularization is a technique used to prevent overfitting in linear regression. It adds a penalty term to the loss function, which encourages the model to have smaller coefficients. This helps to prevent the model from fitting the noise in the data and instead focuses on the underlying relationship between the features and the target variable.

Ridge Regression & Lasso Regression

Lasso Regression Uses absolute value penalty for coefficient sparsity: $$\text{Regularized Loss} = \text{Original Loss} + \lambda \sum_{i=1}^{p} |\beta_i|$$ Key properties: Best for:
Ridge Regression Adds a penalty proportional to the square of coefficients to the loss function: $$\text{Regularized Loss} = \text{Original Loss} + \lambda \sum_{i=1}^{p} \beta_i^2$$ Key properties: Best for:

Elastic Net Regression

Elastic Net Combines L1 and L2 penalties through convex combination: $$\text{Regularized Loss} = \text{Original Loss} + \lambda \left( \alpha \sum_{i=1}^{p} |\beta_i| + (1 - \alpha) \sum_{i=1}^{p} \beta_i^2 \right)$$ Key advantages: Implementation considerations:

Logistic Regression

log-odds

A Linear Model assumes $$y\sim \mathcal{N}(\mathbf{b}^T\mathbf{x},\sigma)$$ i.e. at any given point $\mathbf{x}$, it assumes $y= f(\mathbf{x})$ follows a Gaussian distribution, this excludes us from modeling the case where $y=f(\mathbf{x})$ follows other distributions. Consider for example, in a binary Classification problem with labels $0$ and $1$. Then at any point $\mathbf{x}$, it is reasonable to assume that $$f(\mathbf{x})\sim \mathtt{Bernoulli}(p(\mathbf{x}))$$ And this leads us naturally to the Logistic Regression.

In Logistic Regression, we assume for some parameter $\theta$, $$y\sim \mathtt{Bernoulli}(p_\theta{(\mathbf{x})})$$ Recall that the odds of $y=1$ conditioned on $\mathbf{x}$ is given by $$\frac{\mathtt{Pr}(y=1\mid \mathbf{x})}{\mathtt{Pr}(y=0\mid \mathbf{x})}= \frac{p_\theta{(\mathbf{x})}}{1-p_\theta{(\mathbf{x})}}\geq 0$$ Consider apply the monotonic function $\log$ to the above, we get the log-odds of $y=1$ or equivalently the logit function on $p_\theta{(\mathbf{x})}$, i.e. $$\mathtt{logit}(p_\theta{(\mathbf{x})}):=\log\left(\frac{p_\theta{(\mathbf{x})}}{1-p_\theta{(\mathbf{x})}}\right)\in \mathbb{R}$$ and a naive additional assumption is that the log-odds is a linear function of $\mathbf{x}$, i.e. $$\log\left(\frac{p_\theta{(\mathbf{x})}}{1-p_\theta{(\mathbf{x})}}\right)=\theta\cdot \mathbf{x}$$ thus by the above assumptions, we have $$p_\theta{(\mathbf{x})}= \frac{1}{1+\exp(-\theta\cdot \mathbf{x})}=:\mathtt{sigmoid}(\theta\cdot \mathbf{x})$$ Where sigmoid is the inverse of the logit function.

And the goal of logistic regression is try to estimate the probabilities $p_\theta(\mathbf{x})$. Since the sigmoid function is not linear, it is more connivent to use MLE to estimate $\theta$. To maximize the likelihood function, is equivalent to minimize the negative of the log-likelihood function $$L(\theta)=-\sum_{i=1}^{n} y^{(i)} \log \left(p_\theta(\mathbf{x}^{(i)})\right)+\left(1-y^{(i)}\right) \log \left(1-p_\theta(\mathbf{x}^{(i)})\right)$$ The summands are$$\begin{cases}\log \left(p_\theta(\mathbf{x}^{(i)})\right),\qquad & y^{(i)}=1\\ \log \left(1-p_\theta(\mathbf{x}^{(i)})\right) & y^{(i)}=0 \end{cases} $$ The above loss function is also known as the Binary Cross Entropy function. Using Gradient Descend, we get the following updating rule $$\theta_{j+1}=\theta_{j}-\alpha X^{T}(p_\theta(X)-Y)$$ Where $X$ is the design matrix.

Model Coefficients

The logistic regression model can be expressed as:

$$ \log\left(\frac{P(y=1)}{1-P(y=1)}\right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n $$ Taking the exponential of both sides, we get: $$ \begin{align} \text{odds}(y=1) &= \exp(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n)\\ &= \exp(\beta_0)\exp(\beta_1 x_1)\exp(\beta_2 x_2)\cdots\exp(\beta_n x_n) \end{align} $$ The exponential of the coefficients $$\exp(\beta_j)$$ gives the relation between the odds of $y=1$ and the corresponding feature $x_j$.

Generalized Linear Models

A Generalized Linear Model is of the form $$\eta\left(\mathrm{E}\left(Y \mid \bf x\right)\right)= \theta \cdot \bf x',\qquad \bf x' = \begin{pmatrix}1\\ \bf x \end{pmatrix} $$ for some link function $\eta$, and we assume that the entries of $ Y$ are independently sampled from a Exponential Family.

Examples of GLM

  1. The Linear Model is a GLM , where the link function is the identity function, and $Y$ is from the Gaussian Distribution (an exponential family).
  2. The Logistic Regression Model is a GLM, where the link function is the logit function $$\eta(p)=\log \left(\frac{p}{1-p}\right)$$ and $Y$ is from the Bernoulli Distribution (another example of exponential family).
  3. If we let $y$ takes on the following PMF $$\mathtt{Pr}(y=k \mid\mathbf{x})= \begin{cases}\frac{\exp(\beta_k \cdot \bf x)}{1+\sum_{l=1}^{K-1}\exp(\beta_l\cdot \mathbf{x})} \quad & k = 1,2,\cdots,K-1 \\ \frac{1}{1+\sum_{l=1}^{K-1}\exp(\beta_l\cdot \bf x)} & k=K \end{cases} $$ and $$\eta(p) = \log(\frac{p}{\mathtt{Pr}(y=K| \mathbf{x})}) $$ we get the Multinomial Logistic Regression.
  4. If we let $Y\sim \mathtt{Poisson}(\lambda)$, and take $\eta = \log$, then we get the Poisson Regression Model. $$ \lambda:=\mathrm{E}(Y \mid x)=e^{\theta^{\prime} x} $$ $$ p(y \mid x ; \theta)=\frac{\lambda^y}{y !} e^{-\lambda}=\frac{e^{y \theta^{\prime} x} e^{-e^{\theta^{\prime} x}}}{y !} $$
    1. Modeling Default Rates $$\log(\lambda) = \beta_0 + \beta_1 (\text{unemployment rate}) + \beta_2 (\text{loan-to-income ratio}) + \log(\text{exposure})$$ Where $\lambda$ is the expected default count, and $\text{exposure}$ accounts for loan volume

Mixed Effect Models

A Linear Mixed Effect Model is of the form $$\mathbf{y} = X\beta + U\gamma + \mathbf{\epsilon}$$ Where
  • $X,U$ are fixed design matrices
  • $\beta$ is an unknown fixed vector
  • $\gamma\sim \mathscr{N}{(0,G)}$, $\epsilon\sim \mathscr{N}{(0,R)}$
If $U$ is the zero matrix, then the model is just a GLM.

import statsmodels.api as sm
import statsmodels.formula.api as smf

# Example data
data = sm.datasets.get_rdataset("dietox", "geepack").data

# Fit model
model = smf.mixedlm("Weight ~ Time", data, groups=data["Pig"])
results = model.fit()

# Predict fixed-effects only
fixed_predictions = results.predict(data)

# Predict with random effects
random_effects = results.random_effects
group_predictions = fixed_predictions + data["Pig"].map(random_effects)

Neural Networks

Logistic Regression or Linear Model can be viewed as a single layer neural network with sigmoid / identity activation function and Cross Entropy loss function. We will discuss the neural network in more detail in a later page Neural Networks.